14 research outputs found
Reaching the Edge of the Edge: Image Analysis in Space
Satellites have become more widely available due to the reduction in size and
cost of their components. As a result, there has been an advent of smaller
organizations having the ability to deploy satellites with a variety of
data-intensive applications to run on them. One popular application is image
analysis to detect, for example, land, ice, clouds, etc. for Earth observation.
However, the resource-constrained nature of the devices deployed in satellites
creates additional challenges for this resource-intensive application.
In this paper, we present our work and lessons-learned on building an Image
Processing Unit (IPU) for a satellite. We first investigate the performance of
a variety of edge devices (comparing CPU, GPU, TPU, and VPU) for
deep-learning-based image processing on satellites. Our goal is to identify
devices that can achieve accurate results and are flexible when workload
changes while satisfying the power and latency constraints of satellites. Our
results demonstrate that hardware accelerators such as ASICs and GPUs are
essential for meeting the latency requirements. However, state-of-the-art edge
devices with GPUs may draw too much power for deployment on a satellite. Then,
we use the findings gained from the performance analysis to guide the
development of the IPU module for an upcoming satellite mission. We detail how
to integrate such a module into an existing satellite architecture and the
software necessary to support various missions utilizing this module
Scalable and dynamically balanced shared-everything OLTP with physiological partitioning
Scaling the performance of shared-everything transaction processing systems to highly parallel multicore hardware remains a challenge for database system designers. Recent proposals alleviate locking and logging bottlenecks in the system, leaving page latching as the next potential problem. To tackle the page latching problem, we propose physiological partitioning (PLP). PLP applies logical-only partitioning, maintaining the desired properties of sharedeverything designs, and introduces a multi-rooted B+Tree index structure (MRBTree) that enables the partitioning of the accesses at the physical page level. Logical partitioning and MRBTrees together ensure that all accesses to a given index page come from a single thread and, hence, can be entirely latch free; an extended design makes heap page accesses thread private as well. Moreover, MRBTrees offer an infrastructure for easy repartitioning and allow us to have a lightweight dynamic load balancing mechanism (DLB) on top of PLP. Profiling a PLP prototype running on different multicore machines shows that it acquires 85 and 68%fewer contentious critical sections, respectively, than an optimized conventional design and one based on logical-only partitioning. PLP also improves performance up to almost 50 % over the existing systems, while DLB enhances the system with rapid and robust behavior in both detecting and handling load imbalance
An Analysis of Collocation on GPUs for Deep Learning Training
Deep learning training is an expensive process that extensively uses GPUs,
but not all model training saturates modern powerful GPUs. Multi-Instance GPU
(MIG) is a new technology introduced by NVIDIA that can partition a GPU to
better-fit workloads that do not require all the memory and compute resources
of a full GPU. In this paper, we examine the performance of a MIG-enabled A100
GPU under deep learning workloads containing various sizes and combinations of
models. We contrast the benefits of MIG to older workload collocation methods
on GPUs: na\"ively submitting multiple processes on the same GPU and utilizing
Multi-Process Service (MPS). Our results demonstrate that collocating multiple
model training runs may yield significant benefits. In certain cases, it can
lead up to four times training throughput despite increased epoch time. On the
other hand, the aggregate memory footprint and compute needs of the models
trained in parallel must fit the available memory and compute resources of the
GPU. MIG can be beneficial thanks to its interference-free partitioning,
especially when the sizes of the models align with the MIG partitioning
options. MIG's rigid partitioning, however, may create sub-optimal GPU
utilization for more dynamic mixed workloads. In general, we recommend MPS as
the best performing and most flexible form of collocation for model training
for a single user submitting training jobs
OLTP on Hardware Islands
Modern hardware is abundantly parallel and increasingly heterogeneous. The
numerous processing cores have non-uniform access latencies to the main memory
and to the processor caches, which causes variability in the communication
costs. Unfortunately, database systems mostly assume that all processing cores
are the same and that microarchitecture differences are not significant enough
to appear in critical database execution paths. As we demonstrate in this
paper, however, hardware heterogeneity does appear in the critical path and
conventional database architectures achieve suboptimal and even worse,
unpredictable performance. We perform a detailed performance analysis of OLTP
deployments in servers with multiple cores per CPU (multicore) and multiple
CPUs per server (multisocket). We compare different database deployment
strategies where we vary the number and size of independent database instances
running on a single server, from a single shared-everything instance to
fine-grained shared-nothing configurations. We quantify the impact of
non-uniform hardware on various deployments by (a) examining how efficiently
each deployment uses the available hardware resources and (b) measuring the
impact of distributed transactions and skewed requests on different workloads.
Finally, we argue in favor of shared-nothing deployments that are topology- and
workload-aware and take advantage of fast on-chip communication between islands
of cores on the same socket.Comment: VLDB201
Toward Scalable Transaction Processing Evolution of Shore-MT
Designing scalable transaction processing systems on modern multicore hardware has been a challenge for almost a decade. The typical characteristics of transaction processing workloads lead to a high degree of unbounded communication on multicores for conventional system designs. In this tutorial, we initially present a systematic way of eliminating scalability bottlenecks of a transaction processing system, which is based on minimizing unbounded communication. Then, we show several techniques that apply the presented methodology to minimize logging, locking, latching etc. related bottlenecks of transaction processing systems. In parallel, we demonstrate the internals of the Shore-MT storage manager and how they have evolved over the years in terms of scalability on multicore hardware through such techniques. We also teach how to use Shore-MT with the various design options it offers through its sophisticated application layer Shore-Kits and simple Metadata Frontend. 1
Reviving the Workshop Series on Testing Database Systems – DBTest
With the ever increasing complexity of database systems and their pervasive use in industry, testing them has been an important issue for a long time. Recognizing this relevance, researchers and industry have started the Workshop Series on Testing Database Systems in 2008 collocated with ACM SIGMOD. Six instances of the workshop were successfully run until 2013. Five years later, in 2018, we revived the workshop in a new, biannual format. Today, the DBTest workshop consistently has high-quality submissions, expert presenters, and active participants across both academia and industry. Going forward, we plan to open the workshop up to an even more diverse audience, especially the research communities that focus on software testing and debugging in general, and not only on database systems
From A to E: Analyzing TPC’s OLTP benchmarks: the obsolete, the ubiquitous, the unexplored
ABSTRACT Introduced in 2007, TPC-E is the most recently standardized OLTP benchmark by TPC. Even though TPC-E has already been around for six years, it has not gained the popularity of its predecessor TPC-C: all the published results for TPC-E use a single database vendor's product. TPC-E is significantly different than its predecessors. Some of its distinguishing characteristics are the non-uniform input creation, longer-running and more complicated transactions, more difficult partitioning etc. These factors slow down the adoption of TPC-E. In turn, there is little knowledge in the community about how TPC-E behaves micro-architecturally and within the database engine. To shed light on TPC-E, we implement it on top of a scalable open-source database engine, Shore-MT, and perform a workload characterization study, comparing it with the previous, much better known OLTP benchmarks of TPC: TPC-B and TPC-C. In parallel, we study the evolution of the OLTP benchmarks throughout the decades. Our results demonstrate that TPC-E exhibits similar micro-architectural behavior to TPC-B and TPC-C, even though it incurs less stall time and higher instructions per cycle. On the other hand, within the database engine it suffers more from logical lock contention. Therefore, we argue that, on the hardware side, TPC-E needs less aggressive processors. Whereas on the software side it can benefit from designs based on intra-transaction parallelism, logical partitioning, and optimistic concurrency control to minimize the effects of lock contention without introducing distributed transactions